Bioinformatics – A Genetic and Epigenetic Taster

This workshop aims to give you a taste of bioinformatics by utalising some basic tools and analyses to explore and interpret both genetic and epigenetic data.

We’re going to be expanding upon some of the topics that you’ve covered in genetics and epigenetics over the last couple of days. As such this workshop consists of two parts:

As always please don’t hesitate to ask your questions and prompt discussion!


1. Population Genetics

We will investigate genetic variation (SNPs) from several populations which have had individuals sequenced as part of the 1000 Genomes project.

Note: The majority of code and data for this section is borrowed/modified from the course GENE360 maintained by Associate Professor Mik Black. You can find the full resource here: https://github.com/mikblack/GENE360-PopDiv

Overview

The objective of this module is to give you some familiarity with the types of data and tools that are commonly used for genetic studies of population diversity.

The data we will be using are publicly available high-throughput sequencing data from the 1000 Genomes Project. The main web page for the 1000 Genomes Project is:

http://1000genomes.org

The following page provides background information about the 1000 Genomes Project (and is worth reading, particularly the information about the populations being used in the study):

http://www.1000genomes.org/about

We will be using the R and RStudio software applications to analyze the data.

R: https://cran.rstudio.com/ RStudio: https://www.rstudio.com/product/rstudio/download/

Getting started - loading the data

The first data set we will look at contains genetic data on just a few loci (specific points in the genome) across the various populations being studied. Looking at this small data set will give you a feel for what we mean by “genotype data” for a particular locus.

Open RStudio and read in the first data set using the following command:

# Read in the SNP data
snpData = read.table('data/GENE360snpData.txt',sep='\t',header=T)

Looking at SNP frequencies

There are nine columns in the data set that relate to specific single nucleotide polymorphisms (SNPs) in the genome - we have the genotype data for each SNP for every individual in the data set.

We can use the “table” command to summarize the genotype information for each SNP:

# Make a genotype frequency table for the first SNP
table(snpData$rs3826656)
AA AG GG
1112 952 440

We can also calculate the proportions associated with each genotype:

# Calculate proportions
prop.table(table(snpData$rs3826656))
AA AG GG
0.4440895 0.3801917 0.1757188

and examine differences in genotype frequencies across populations:

# Create contingeny table - genotypes across populations
table(snpData$Population, snpData$rs3826656)
/ AA AG GG
AFR 445 199 17
AMR 225 109 13
EAS 47 224 233
EUR 312 161 30
SAS 83 259 147
# And calculate proportions (rounded)
round(prop.table(table(snpData$Population, snpData$rs3826656),1),2)
/ AA AG GG
AFR 0.67 0.30 0.03
AMR 0.65 0.31 0.04
EAS 0.09 0.44 0.46
EUR 0.62 0.32 0.06
SAS 0.17 0.53 0.30

These results can be plotted as a bar plot:

snpFreqs = t(prop.table(table(snpData$Population, snpData$rs3826656), 1))
barplot(snpFreqs, beside=TRUE, legend.text=TRUE, ylim=c(0,1))

These analyses can be repeated for different SNPs by changing the SNP ID (e.g., rs3826656) in the above commands.

Population diversity

The second data set that we will look at can be used to investigate population (genetic) diversity. These SNPs were chosen because they are “ancestry informative” - variation in their genotype frequencies tends to be associated with differences between population groups.

# Load ancestry data
snpAns = read.table('data/GENE360snpAncestry.txt', header=T, sep='\t')
# Check dimensionality of data
dim(snpAns)
## [1] 2504 2305

Now we have a lot more data: 2504 individuals, and 2302 SNP genotypes for each. Note that there are 2305 columns in the data set, but the first three relate to non-SNP data.

To examine population diversity, we need to do two things:

  1. create a data object of ONLY the SNP genotype data (i.e., remove the first three columns).
  2. convert the genotypes to allele counts (e.g., TT, AT, AA to 0, 1, 2 - count the number of A’s).

Step 1:

# Create object containing only the SNP data - remove the first three columns
snpAnsDat = snpAns[,-c(1,2,3)]

Step 2:

# Load a custom function to convert to allele counts
source('alleleCounts.R')

# Apply the function to the genotype data, one column (SNP) at a time
# This will take 30 seconds or so...
snpAnsCount = apply(snpAnsDat, 2, alleleCounts)

Now we have a data set of just the SNP data, with genotypes converted to allele counts.

So, what are we going to do with this new data set?

Dimension reduction (principal components analysis)

In terms of examining population diversity, we have 2302 dimensions of data available - one dimension for each SNP.

Rather than trying to comprehend this huge amount of data in 2302-dimensional space, genetics researchers often use a statistical tool called Principal Components Analysis (PCA) to reduce the dimensionality of the data.

The idea is to find the most important variation in the data, and examine the samples in terms of that variation, ignoring the rest. In practice, this works fairly well, because genetic differences between populations provide a strong (and relatively consistent) source of variation across genomic loci (i.e., SNPs). Rather than looking at 2302 dimensions of data, we end up looking at variation across just 2 or 3 dimensions - each dimension is defined by a combination of SNPs which vary in a similar way across the individuals in the study.

We can plot the principal components and colour the points based on the population each sample belongs to.

Perform the PCA to create the first three components needed for plotting:

# use flashpcaR package for fast PCA analysis
# perform PCA
r <- flashpca(snpAnsCount, do_loadings=TRUE, verbose=TRUE, stand="binom", ndim=3)

PCA plots

From the plots we can see that samples from the same population tend to cluster together, and that the first three principal components do a reasonable job of capturing the genetic diversity between the populations.

With the scatterplot3d package, you can plot the first three principal components at once (i.e., combining the information from the three scatterplots above). This shows that the European (EUR), East Asian (EAS) and South Asian (SAS) super-populations are relatively homogeneous, while the Ad-Mixed American (AMR) and African (AFR) super-populations exhibit greater variation, suggesting admixture within these groups.

# 3D plotting of PCA results
scatterplot3d(r$vectors[,1], r$vectors[,2], r$vectors[,3], color=alpha(as.numeric(snpAns$Population), 0.5), pch=16,
              cex.symbols=0.5, xlab="PC1", ylab="PC2", zlab="PC3")
legend(-3.5, 2.5, legend = c('AFR', 'SAS', 'EAS', 'AMR', 'EUR'), col = c('black', 'green', 'cyan', 'red', 'blue'), pch = 16, bty = 'n')

Interactive 3D plot

We can take this a step further and using the plotly package we can visualise the three components in 3D space interactively - give it a spin!!

Phylogenetic trees

Another way to visualise genetic similarity is via “phylogenetic trees”, also known more generally as “dendrograms”.

These tree diagrams group items together based on similarity scores. In this setting our “items” are the populations, and the scores are calculated based on the genetic similarities between each pair of populations.

To calculate similarities, we need to create a set of “average” genotypes for each population. One way to do this is to calculate the frequency of the major homozygote for each SNP in each population.

# Load custom function for calculating major homozygote frequency:
source('calcMajorFreq.R')

# Apply this function to the SNP count data
# Takes a few seconds
popFreqs = calcMajorFreq(snpAnsCount, snpAns$Population)

Plot the tree:

# Make a cluster tree based on these frequencies
plot(hclust(dist(popFreqs)),hang=-1)

The way in which the “leaves” of the tree cluster, reflects the similarity between the items analysed. In this case the populations are grouped based on their genetic similarity, as measured by these particular loci. Here it appears that the AMR (Ad-Mixed American), and SAS (South Asian) super-populations are most genetically similar, then EUR (European), followed by EAS (East Asian). The AFR (African) super-population is the most genetically dissimilar relative to the others. It is likely that these groupings reflect (to some degree) migration patterns and shared ancestry experienced by these populations.


2. Epigenetics - CpGs and DNA methylation

We will explore some DNA methylation data and demonstrate how specific markers may allow the discrimination of both tissues and cells.

Note: This section borrows from previous epigenetic workshops, see below for the full resources:

Overview

Depending on which genes are “turned on and off” the consequences can vary. This is particularly important in the context of differentiation between such things as tissues and cells, which have very different developmental origins and purposes, thus it fits that they will also differ on an epigenetic level - epigenetics is one of the major developmental drivers. It is important to remember that in this case the genetic code of the tissues and the cells of a given individual is identical, i.e. genetic variation isn’t going to help distinguish the tissues/cells of an individual… So we’re going to use epigenetics to see if we can determine what tissues we’re dealing with here.

DNA Methylation

One epigenetic mechanism for which there is a lot of data available is DNA methylation. You can think of this as an underline that occurs under two specific letters, in DNA language this is CG, but you could think of it as words with a “GN” next to each other e.g magnetic, gnat, gnome, significant.

A very basic overview of a methylation mechanic:

CpG methylation illustration

CpG methylation illustration

So to visualise methylation at a particular CG between tissues we can create a bar chart similar to the genotype one we used earlier in the population genetics section.

# create some tissue methylation data
tissue_data <- data.frame(CG_1 = c(100, 50, 0), CG_2 = c(50, 50, 100), CG_3 = c(0, 100, 50))
rownames(tissue_data) <- c("blood", "heart", "brain")
tissue_data <- as.matrix(tissue_data)

barplot(tissue_data, beside=TRUE, legend.text=TRUE, ylim=c(0,100), col = c('cadetblue', 'darkred', 'grey65'), ylab = '% methylation',
        args.legend = list(bty = 'n'))

What can we infer about the CG sites and the tissues above?

Load data

The data for this workshop has been provided for you in the Github repository. The methylation data (beta values) for 5 whole blood and 5 buccal (cheek swab) samples is what we will be working with first.

Load this into R:

# load beta/methylation data for tissues
betas <- readRDS('data/blood_buccal.RDS')

Can we distinguish between tissues with methylation?

To start with we’ll create an object called tissues which provides identification of our blood and buccal samples:

# define samples
tissues <- c(rep(0, 5), rep(1, 5))
tissues <- as.factor(tissues)
levels(tissues) <- c('Blood', 'Buccal')

In a very similar manner as before with the population genetics module we’re going to explore the variation of methylation values (beta values) by creating a tree. Again the distance between the ‘leaves’ in this tree indicates how closely related each sample is to the others, the further apart these leave the more variation that exists to distinguish one sample from another - or one tissue from another.

Here we perform hierarchical clustering on the 10 samples using all available data:

# Hierarchical Clustering
d <- t(betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main="Cluster on all probes", labels = tissues, xlab = '', sub="")

What do you conclude from the above?

Identification of the ‘best’ discriminatory markers

It is very obvious that we can separate the two tissues when looking at all available CpG sites. We should therefore be able to identify a panel of the most ‘robust’ markers with which to discriminate between the tissues - these are often referred to as biomarkers. We aim to find markers that clearly differ between the two tissues and we hope to build up a panel of all such markers. After this we’ll see what happens when we use this panel to create another tree.

First we’ll perform a quick association analysis between our tissues and each of our methylation sites individually. This sounds like it might take quite a long time, but the beauty of R means that these types of analysis can be quickly batched together. We’re going to use a function from the package minfi to test all 450K methylation markers (CpGs) and then look at some of the top sites (by association).

First we run the association testing, run dmpFinder:

# minfi dmpfinder
results <- dmpFinder(betas, pheno = tissues, type = "categorical", qCutoff = 1)
results <- na.omit(results)

Now we extract these sites into an object called top.results, we do this based upon an association statistic - a value that tells us which sites are most robustly associated with tissue type.

# filter results
top.results <- results[results$qval <= 0.00005,]
head(top.results)
intercept f pval qval
cg26297005 0.9248713 2779.165 0 2.6e-06
cg16780847 0.9185569 2179.933 0 2.6e-06
cg08780218 0.8801236 2021.894 0 2.6e-06
cg07759042 0.9810960 2003.060 0 2.6e-06
cg05350268 0.1052552 1776.359 0 3.3e-06
cg26071526 0.0753421 1444.052 0 4.4e-06

Lets look at a few methylation sites that have the ‘best’ association rankings. We can easily visualise these using box plots:

# create boxplots for 4 results
par(mfrow = c(2,2))
boxplot(betas['cg26297005',]*100 ~ tissues, ylim = c(50,100), main = 'cg26297005', ylab = '% methylation')
boxplot(betas['cg16780847',]*100 ~ tissues, ylim = c(50,100), main = 'cg16780847', ylab = '% methylation')
boxplot(betas['cg05350268',]*100 ~ tissues, ylim = c(0,60), main = 'cg05350268', ylab = '% methylation')
boxplot(betas['cg07759042',]*100 ~ tissues, ylim = c(50,100), main = 'cg07759042', ylab = '% methylation')

What can we say about the above?

Revisting our tree using specific discriminatory markers

We previously performed clustering on all CpG sites (that is nearly half a million markers!). This time we are going to cluster using only those markers that pass our defined threshold - those that best discriminate between the tissues.

First extract a smaller matrix of all significant CpG sites:

# subset the beta matrix by the top markers and cluster
sig.betas <- betas[rownames(betas) %in% rownames(top.results), ]

Now perform the clustering:

# Hierarchical Clustering
d <- t(sig.betas)
rownames(d) <- tissues
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main = "Cluster on selected top tissue discrimination CpG sites", labels = tissues, 
    xlab = "", sub = "")

Above we have our new tree - what do you see? What can we conclude about our panel of markers?

From tissues to cells

Tissues are defined by a combination of specialised cells, making them heterogeneous by nature - this means that many different cell types make up a specific tissue, and these obviously differ between all the different tissues which make up an organism. Above when we looked at blood and buccal methylation profiles we were looking at the overall contribution of different cells to the ‘global’ methylation profile of that tissue.

If we consider blood, it is well established that all cells that contribute to blood as a tissue are derived from a single origin (stem cell) - thus a lineage of these cells can be constructed.

blood cell lineage (image sourced from wikipedia)

blood cell lineage (image sourced from wikipedia)

DNA methylation in blood - potential to provide new insights in immune cell biology

So with epigenetics (DNA methylation) we have the potential to identify a “new”, previously unrecognised cell type or inflammatory response related to disease by exploring these profiles. DNA methylation can be thought of as a unique ‘barcode’ for each cell type, distinguishing them from others.

We can ‘isolate’ specific populations by cell-sorting - this could allow us to find additional CpG sites that help ‘deconvolute’/identify some of the harder to identify sub-populations in cell-types, or previously unrecognised immune cell differences [changes in lineage] associated with disease. The differential methylation may indicate the emergence of a ‘new’ cell population (i.e. when looking at a given cell population (CD4+) and you see differences they might be due to increase/decrease/presence of a specific known/unknown sub-population of cells).

Here is a quote from a recent article putting this in context:

“Our results underline that DNA hypomethylation at cg19859270 site, observed in WBCs of smokers, did not arise by direct effect of tobacco smoking compounds on methylation of DNA but rather by the enrichment of a tobacco-smoking-induced lymphocyte population in the peripheral blood
Bauer M, et al. A varying T cell subtype explains apparent tobacco smoking induced single CpG hypomethylation in whole blood. Clinical Epigenetics 7 Springer Nature, 2015.

What does this look like?

Figure 4 from Bauer *et al.,* 2015.

Figure 4 from Bauer et al., 2015.

Looking at cell-sorted data

Now to get more hands on again! We’re going to reconstruct the blood lineage illustrated above using real-world data. For this we are going to use some publicly available data for 10 different cell populations. This comes from the below paper:

Reinius, L. E. et al. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS One 7, e41361 (2012).

Let’s start by loading the Reinius data:

# read in the methylation data for 10 different cell populations
blood_data <- read.csv('data/publicbeta_data.csv', head = T, row.names = 1, as.is = F)
colnames(blood_data) <- gsub('\\.', '', colnames(blood_data))

Now we have the data we’ll create another tree:

# Hierarchical Clustering
d <- t(blood_data)
rownames(d) <- colnames(blood_data)
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main = "Cluster on selected top cell type discrimination CpG sites", 
    labels = colnames(blood_data), xlab = "", sub = "")

What do we see? What can we say about this tree? Is there anything ‘odd’ that you can see in this tree?

Again we can look at this data in a different fashion by reducing the dimensionality - using PCA:

# create a data frame
blood_data2 <- data.frame(cells = as.factor(colnames(blood_data)), t(blood_data))
# clean cell type groups
blood_data2$cells <- gsub('_.*', '', blood_data2$cells)
blood_data2$cells <- as.factor(blood_data2$cells)
# perfom the PCA
cell_pca <- PCA(blood_data2, quali.sup = 1, graph = F)
# create the plots
plot.PCA(cell_pca, habillage = 'cells', label = 'quali', axes = c(1,2))

plot.PCA(cell_pca, habillage = 'cells', label = 'quali', axes = c(1,3))

plot.PCA(cell_pca, habillage = 'cells', label = 'quali', axes = c(2,3))

plot.PCA(cell_pca, habillage = 'cells', label = 'quali', axes = c(3,4))

Which components clearly distinguish certain cell types?

Note: look at component 4 and consider the cell lineage figure and tree plot we created - there is something really interesting and important to be discussed here…

It is quite obvious that the CD19 population is the most ‘different’ based on the selected methylation markers. What’s nice is that there is a lot of biological evidence supporting this.

… and again we can use plotly to create a nice interactive 3D view:


EXTRA (time permiting): Analysis of a disease phenotype

Here is a quick worked example of looking for differential methylation in relation to disease status. In this case we’re going to be using data previously published which explored methylation profiles between multiple sclerosis patients and healthy controls:

Graves M, Benton MC, Lea R, Boyle M, Tajouri L, Macartney-Coxson D, Scott R and Lechner-Scott J. (2013) Methylation differences at the HLA-DRB1 locus in CD4+ T-Cells are associated with multiple sclerosis. Multiple Sclerosis 20(8):1033-1041

Load the data

# load data (ms vs hc)
ms_data <- readRDS('data/disease_data.RDS')
# set variable with disease status
disease <- colnames(ms_data)

Can we ‘cluster’ by phenotype before analysis - another tree plot…

# Hierarchical Clustering
d <- t(ms_data)
rownames(d) <- colnames(disease)
d <- dist(d, method = "minkowski", p = 2)
hc <- hclust(d)
plot(hc, main = "Cluster on all CpG sites", labels = disease, xlab = "", sub = "")

Hmmm… what do you think might be going on here?

Differential methylation analysis

# minfi dmpfinder
results <- dmpFinder(as.matrix(ms_data), pheno = disease, type = "categorical", qCutoff = 1)
results <- na.omit(results)
head(results)
intercept f pval qval
cg23402097 0.9909888 125.33956 6.00e-07 0.2544528
cg25822783 0.7052933 101.69612 1.50e-06 0.3346246
cg23365293 0.3762564 88.71137 2.70e-06 0.4160500
cg24274665 0.0419879 69.27511 8.30e-06 0.9448167
cg27343976 0.9436134 59.77553 1.59e-05 0.9471809
cg00660251 0.0653198 52.97146 2.67e-05 0.9471809

Plotting some disease associated methylation sites

# create boxplots of 4 CpG sites from results
par(mfrow = c(2,2))
boxplot(as.numeric(ms_data['cg25822783',])*100 ~ disease, col = alpha(c('cadetblue', 'darkred'), 0.5), main = 'cg25822783', ylab = '% meth')
boxplot(as.numeric(ms_data['cg23365293',])*100 ~ disease, col = alpha(c('cadetblue', 'darkred'), 0.5), main = 'cg23365293', ylab = '% meth')
boxplot(as.numeric(ms_data['cg24274665',])*100 ~ disease, col = alpha(c('cadetblue', 'darkred'), 0.5), main = 'cg24274665', ylab = '% meth')
boxplot(as.numeric(ms_data['cg27343976',])*100 ~ disease, col = alpha(c('cadetblue', 'darkred'), 0.5), main = 'cg27343976', ylab = '% meth')

**Tip:** don’t be fooled by scale…

# create boxplots of 4 CpG sites from results
par(mfrow = c(2,2))
boxplot(as.numeric(ms_data['cg25822783',])*100 ~ disease, ylim = c(0,100),
        col = alpha(c('cadetblue', 'darkred'), 0.5), main = 'cg25822783', ylab = '% meth')
boxplot(as.numeric(ms_data['cg23365293',])*100 ~ disease, ylim = c(0,100),
        col = alpha(c('cadetblue', 'darkred'), 0.5), main = 'cg23365293', ylab = '% meth')
boxplot(as.numeric(ms_data['cg24274665',])*100 ~ disease, ylim = c(0,100),
        col = alpha(c('cadetblue', 'darkred'), 0.5), main = 'cg24274665', ylab = '% meth')
boxplot(as.numeric(ms_data['cg27343976',])*100 ~ disease, ylim = c(0,100),
        col = alpha(c('cadetblue', 'darkred'), 0.5), main = 'cg27343976', ylab = '% meth')

These are the same plots with the y axis scaled between 0-100%, you have to be careful and keep an eye out for this sort of thing and ask yourself if the given association is ‘real’ or a potential ‘artifact’ in the data.

Let’s try another tree now…

We’re going to take all CpG sites with an association p < 0.005:

# get a list of cpgs passing set threshold
cpgs <- rownames(results[results$pval < 0.005,])
# filter the ms data for these sites
data_filtered <- ms_data[rownames(ms_data) %in% cpgs,]

Now let’s try that again:

# Hierarchical Clustering
d <- t(data_filtered)
rownames(d) <- colnames(disease)
d <- dist(d, method = "canberra", p = 2)
hc <- hclust(d)
plot(hc, main = "Cluster on selected top discriminatory CpG sites", labels = disease, 
    xlab = "", sub = "")

So we’ve found a set of markers that can clearly ‘separate’ disease phenotype in this group of samples.


EXTRA (time permiting): Forensics – a case study

Let’s suspend belief for a minute and create a fictional scenario to apply some of what we’ve learnt today.

You’re a ‘forensic bioinformatician’ and have been handed a sample from a crime scene. You are requested to identify the tissue of origin of this sample. You have a ‘database’ of tissue methylation profiles on hand and your knowledge from above.

Discuss how you think you would go about this (considering DNA methylation in this case).

If you were asked to identify a specific individual in this case how would you go about it?


Thank you